Outcome: Data Lake

Purpose

To store all the data that can be used by the Big Data ecosystem.

Relationships

Roles	Responsible:	Modified By: Big Data Engineer Data Scientist
Tasks	Input To: I10A.1 Deploy cluster and nodes I10A.2 Implement storage system I10B.1 Decide a commercial virtual solution I10B.2 Configure the commercial virtual solution I5.2 Implement the Collector component I5.3 Implement security solutions for the Collector I6.1 Identify the important information from data I6.2 Define scripts to prepare the data I7.3 Implement security solutions for the Analyzer	Output From: I5.1 Define the data lake I5.2 Implement the Collector component

Main Description

A data lake is a storage repository that holds a huge amount of raw data as it was generated, while it is still not necessary to process it. In general, data lakes store unstructured data but they can combine different kinds of data. The data lake is part of the Collector component, as it stores the raw data received from the data sources. For that reason, it is important that it is aligned with the defined requirements. Indeed, this data lake can be better managed if we use metadata to try to tackle the problematics of having a huge amount of disorganized data.

References:

N. Miloslavskaya and A. Tolstoy, ‘Application of big data, fast data, and data lake concepts to information security issues’, presented at the Proceedings - 2016 4th International Conference on Future Internet of Things and Cloud Workshops, W-FiCloud 2016, 2016, pp. 148–153.
C. Diamantini, P. L. Giudice, L. Musarella, D. Potena, E. Storti, and D. Ursino, ‘A new metadata model to uniformly handle heterogeneous data lake sources’, Commun. Comput. Inf. Sci., vol. 909, pp. 165–177, 2018.